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ABSTRACT 

We present an automated morphological classification in 4 types (E,S0,Sab,Scd) of ~ 700.000 galaxies from the SDSS DR7 
spectroscopic sample based on support vector machines. The main new property of the classification is that we associate a 
probability to each galaxy of being in the four morphological classes instead of assigning a single class. The classification 
is therefore better adapted to nature where we expect a continuous transition between different morphological types. The 
algorithm is trained with a visual classification and then compared to several independent visual classifications including the 
Galaxy Zoo first-release catalog. We find a very good correlation between the automated classification and classical visual ones. 
The compiled catalog is intended for use in different applications and is therefore freely available thorugh a dedicated webpage 
* and soon from the CasJobs database. 

Key words. Catalogs, Astronomical databases, Galaxies:evolution, Galaxies:formation, Galaxies:fundamental parameters 
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1. Introduction 

Classification of objects is a key step in understanding and 
analyzing an astrophysical sample. In particular, morphol- 
ogy is a powerful tracer of the structure of a galaxy. Since 
Hubbl e's first class ification of galaxies according to their 
shape (lHubblell926l) . it has been shown that this phenomeno- 
logical description hides important physical differences be- 
tween galaxies and probably different evolutionary tracks. 
Elliptical galaxies appear with old stellar populations, high 
velocity dispersion, and small fraction of gas while spiral 
galaxies are more gas-rich, with younger stellar populations 
whose motion is rotation dominated. 

The main problem with morphology comes from estima- 
tion, since, even when done through visual inspection, there 
are several intrinsic problems that can hardly be overcome. 
First, when one goes at high redshift, several new galaxies 
appear that do not necessarily fit in the Hubble fork (e.g. 
Abraham etai] Il994t lAbraham et al.1 119961: IConselice et al.1 
20081: iDelgado-Serrano et al.ll2.010b . and secondly, everybody 
who has looked at galaxies in detail has realized how difficult 
it is to classify them by eye si nce there are lots of o bjects that 
do not fall in a clear box (e.g. lPostman et al1l2005l) . This be- 
comes even worse when other parameters are included such 
as colo rs or stellar dynamics. For exa mple, ISchawinski et"aT1 
(12009b and iKannappan et al.1 (120091) have found a signifi- 
cant fraction of elliptical galaxies with blue colors in the lo- 
cal Universe. In the SAURON project (e.g. lEmsellem et al.l 



120071) . one of the main conclusions is that a significant frac- 
tion of morphologically defined early-type galaxies present 
features similar to late-type ones, such as rotation in their 
cores. The definition of an early or late type galaxy is con- 
sequently not very clear. What defines a given galaxy type? 
Is it just a shape and bulge fraction? or is it shape and stellar 
populations? or is it stellar dynamics? Almost eighty years af- 
ter Hubble's definition, these questions remain unanswered. 

It seems that, instead of being a closed definition, there 
is more like a continuous population of galaxies with some 
canonical objects, prototypes of elliptical, or spiral galaxies 
and then some galaxies that are more or less close to the defi- 
nition. Consequently, it makes more sense to assign distances 
or probabilities of being in one of the canonical classes in- 
stead of having a binary definition that is not necessarily very 
close to reality. 

In addition to these intrinsic issues, there are methodolog- 
ical problems as well because morphological classifications 
are, by definition, done by visual inspection. This job can be 
done on small samples but becomes an impossible task in very 
large surveys such as the SDSS, unless it is done through the 
aggregated efforts of hundreds of thousands of people over 
the course of many mon ths as for the Galaxy Zoo project 
dLintott et al.ll2008l l2010h . 

Lots of effort has been made to try to determine mor- 
phology in an automated and simple way by measuring 
some parameters, such as concentration, asymmetry, dumpi- 
ness, Gini index (e.g. [Abraham et al.l Il996t IConselice et al. 



* http://gepicomQ4.obspm.fr/sdss\protect^orphology/M2fllfM^ through ID (Prieto et al 
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20011: iTruiillo et alj |2001|) or 2D-fi t ting algorithms (e.g . 
Simard et al.l 120021; [Peng eTail 120021; Ide Souza et all 120041: 
Mendez-Abreu et all 120081) . More sophisticat ed classifica 



tions include colors and color gr adients (e.g. Neichel et al.l 
120081) or use neural networks (e.g. lBall et al.ll2.Q04b ; however, 
all these methods deal with a finite number of classes and/or at 
some point require a degree of human intervention. Moreover, 
one can still argue that automated classifications are not real 
morphological classifications since we are just measuring pa- 
rameters of the light distribution while morphology is a much 
more complex pattern recognit i on problem. 

In IHuertas-Companv et al.l ( 120081 l2009bl) we presented a 
method based on support vector machines (galSVM). It was 
initially designed for high-redshift galaxies, and it has the 
advantage of dealing with an unlimited number of parame- 
ters and assigning probabilities instead of binary classes. We 
showed that, when applied to poorly resolved samples, it in- 
creases the accuracy by a factor of ~ 3, compared to more 
classical methods. The method has already been used and 
validated in a variety of different cases on space and ground- 
based data to study, for i nstance, the fraction of blue e arly - 
type galaxies in the field (IHuertas-Companv et al.|[20~ToT) and 
the morphological mixing in clu sters at intermediate redshift 
(IHuertas-Companv et aill2009al) . 

In this paper, we revisit the Hubble sequence in the SDSS 
DR7 spectroscopic sample using this method and assign a 
probability to each galaxy of being in the following morpho- 
logical classes: E,S0,Sab,Scd, instead of a closed class. The 
paper proceeds as follow. In section|2]we describe the sample 
used, and in section [3] the method employed for the classi- 
fication is presented in detail. We discuss the robustness of 
the classification at the faint end in section [4] and a compar- 
ison with a detailed vis ual classification of ~ 14000 galaxies 
dNair & Abrahaml2010l) an d with the Galaxy Zoo first release 
catalog ( Lintott et al.ll2.010i) is shown in section[5] Finally, we 
show some examples of how to use this catalog in section [6] 

2. The sample 

We used all the SDSS DR7 spectroscopic sample as the 
sta rting base. Then, the selecti on of objects was based 
on ISanchez Almeida et al.l (1201 Ol) who performed an unsu- 
pervised automated classification of all the SDSS spectra. 
Basically, we chose galaxies with redshift below 0.25, and 
with good photometric data and clean spectra, meaning ob- 
jects not too close to the edges, not saturated, or not prop- 
erly deblended. The final catalog contains 698420 objects for 
which we estimate the morphology as shown below. No addi- 
tional selection criteria were added so that the catalog is not 
biased to any particular application. 



3. The method 

The classification method is based on support vector 
machines (SVM) implemented in the libSVM library 
(IChang & Linll2001l) . SVM is a machine learning algorithm 
that tries to find the optimal boundary (not necessarily lin- 
ear) between several clouds of points in an N-dimensional 
space. More information about the algorithm can be found in 
IHuertas-Companv et al.l (120081) . There are several interesting 
properties that make this algorithm attractive for galaxy clas- 



sification. First, it can deal with an unlimited number of di- 
mensions so that everything that is related to the classes one 
would like to separate can be included in the classification 
process. Second, it does not deliver a binary classification but 
a probability of belonging to a given class. This probability 
is related to the accuracy of the classification, the higher it is, 
the higher the success rate (and so the closer are the objects 
to the canonical classes), so that the accuracy of the classi- 
fication can be studied in an objective way. This property is 
lacking in most of the existing classification schemes (spe- 
cially in the visual techniques). 



3.1. Training sample 

The SVM method needs a training sample, and all the be- 
havior of the learning algorithm depends on how close this 
training sample is to the real sample one wants to classify. 
For morphological classification, the training sample is typ- 
ically built using a visually classified subsample. The prob- 
lem is that, usually, visual classifications are performed on the 
brightest objects because it is obviously easier, and one would 
like to go fainter in automated classifications. This causes 
a mismatch in the properties of the galaxies in the training 
sample and in the real sample, whic h can lead to misclassi- 
ficatio ns. One solution, as shown in IHuertas-Companv et al.l 
(I2008I) . is to simulate faint galaxies. In this paper, we decided 
not to include any simulations to be able to use the param- 
eters measured in the SDSS database so as be consistent in 
the way parameters were measured in the training and real 
samples. The effects of such (risky) decision s are carefully 
studie d in sections [4] and [5] We therefore used lFukugita et al] 
(I2007I) classification as the training sample. In their paper, 
they provide a visual classification of 2253 SDSS galaxies 
brighter than m r = 16 (compared to the full DR7 sample, 
which goes up to m r ~ 18). Since our goal is to classify 
galaxies in 4 main classes (E,S0,Sab,Sc d), we group them ac- 
cording to their morphological index T dFukugita et al .1120071, 
Table 1): E: T < 1, SO: T = 1, Sab: 2 < T < 4, and Scd: 
4 < T < 7 before using them for training the algorithm. We 
included irregulars (T = 6) in the Scd class since there are 
not enough objects in the local universe (and in particular in 
the iFukugita et al.l (120071) catalog) to make a separate class 
for the training. 



3.2. Procedure 

SVM were originally thought to separate 2 classes. Some im- 
plementations were done to add multi-class separation but the 
accuracy is more difficult to assess. To avoid dealing with 
multi class problems, in this paper we proceeded in two steps. 
First we separated the sample in two main classes, i.e. early- 
type galaxies, which includes ellipticals and SO galaxies, and 
late-type galaxies, which contain all the remaining morpho- 
logical types from Sa to Scd/Im. Then we took the whole 
sample and classified it again using 2 different training sets 
that contain only early-type and late-type galaxies respec- 
tively (see figure[T|i. The probability computed in this second 
step can thus be seen as a conditional probability: "probabil- 
ity of being SO or E given that it is an early-type galaxy" and 
"probability of being Sab or Scd given that it is a late-type 
galaxy". With this approach we were certain to have a broad 
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p(SaWLate) 
p(Scd/Late) 



P(Early) 
p(Lace) 



Fig. 1: Schematic view of the procedure used to classify the sample 
and the probabilities measured in each step. 



classification in two types (which is enough for lots of sci- 
ence applications) with a high success rate, and then a more 
detailed one. Each galaxy in the catalog is therefore associ- 
ated with 6 probability values, i.e. the probability of being 
in the two broad classes and the probability of being in the 
4 subclasses. The 4 probabilities of the 4 subclasses can be 
computed with the Bayes theorem using the conditional prob- 



abilities: 

P(E) = P(Early) X P(E /Early) ( 1 ) 

P(S 0) = P(Early) X P(S 0/ Early) (2) 

P(Sab) = P(Late) x P(Sab/Late) (3) 

P(Scd) = P(Late) x P(Scd/Late) (4) 



We considered in the 4 equations above that 
P(Early/E) = P(Early/S0) = P(Late/Sab) = 
P(Late/Scd) = 1. Following these equations, we obvi- 
ously have P(Early) = P(E) + P(S0) and P(Late) = 
P(Sab) + P(Scd) and P(E)+P(SO)+P(Sab)+P(Scd)=l. 

3.3. Parameters used 

The SDSS database contains lots of photometric and spec- 
troscopic parameters that are related to the morphological 
type of the galaxy and could hence be used for the classi- 
fication. One interesting property of SVM is that they are 
not degenerate, in the sense that adding extra-parameters 
does not lead to a decrease in the classification accuracy 
(iHuertas-Companv et al.ll2008l) even if they do not bring any 
extra information. However, the computing time increases 
and the parameter space is less well sampled if too many pa- 
rameters are included. After several tests, we decided to in- 
clud e three types of param eters: (1) color (g-r,r-i) k-corrected 
with iBlanton et alj d2005) code, (2) shape (isoB/isoA in the 
i-band and deVABj), and (3) light concentration (R90/R50 
in the i-band). For color measurements we use model magni- 
tudes corrected for galactic extinction. isoB and isoA are the 
isophotal minor and major axes respectively, and deVABj is 
the DeVaucouleurs fit b/a. R90 and R50 are the radii contain- 
ing 90% and 50% of the petrosian flux, respectively. Adding 
more parameters does not significantly change the classifica- 
tion and increases the execution time. The decision to include 
the color could be discussed, since, as pointed out in the intro- 
duction, it is not clear how an early-type or a late-type galaxy 
is actually defined. Since our approach is to define classes 



as closely as possible to the canonical definition and then 
compute distances to them, it makes sense to include color. 
Indeed, for an elliptical to be elliptical it should be red, oth- 
erwise it should be called blue elliptical, and it is an excep- 
tion to the normal classification. Eitherway, tests performed 
reveal that removing the color from the parameter space does 
not significantly change the classification. Fewer than 10% of 
the galaxies change their main morphological class. In fig- 
ure |2l we show the 4 probabilities as a function of some rep- 
resentative parameters used in the classification. We observe 
some obvious correlations: i.e. the probability of being ellip- 
tical increases with concentration, and redder galaxies have 
higher probabilities of being ellipticals. The correlations are 
less clear for intermediate classes (SO and Sab). One impor- 
tant conclusion by looking at these plots is that one single pa- 
rameter is not enough to select galaxies with high probability 
of being in a given class. For instance, it is common to use a 
concentration threshold R9 0/R5Q > 2.6 (in the r-band) to se- 
lect e lliptical galaxies (e.g. lBell et a l. 2003; Kauffmann et alj 
12003b . As shown in the top panel of figure[2]this selection re- 
sults in a significant fraction of galaxies with lo w probabilities 
of bein g elliptical galaxies (as also shown in iBernardi et al] 
I2010bh . 

4. Robustness 

4.1. Accuracy at the faint end 

As pointed out in section [3] there is a critical point in our 
approach, since the classified sample contains lots of galaxies 
fainter than the limiting magnitude of the training sample. 
Therefore, it is very important to check that these faint galax- 
ies are not systematically misclassified just because they are 
not represented in the training. As a first check, we computed 
the probability distributions of bright (m g < 16) and faint 
galaxies (m g > 16) in figure [3] to check that faint galaxies 
are syste matically classified with lower probabilities. As 
shown in IHuertas-Companv et al] d2008l) . the probability 
is a kind of measure of how good the classification is and 
how close a given galaxy is to the corresponding associated 
class. Low probabilities in all the classes consequently 
mean that the galaxy is not close to any of the classes of 
the training, which would mean that faint galaxies are not 
properly classified because they are not properly sampled 
in the training set. We observe in figure [3] that there is no 
evident difference between both probability distributions. 
A Kolmogorov-Smirnoff test gives between 99% and 55% 
probability that the 2 distributions are drawn from the same 
distribution, so the possibility that the 2 distributions are 
decoupled is rejected. The probability values seem to be 
quite independent of the galaxy brightness, at least up to the 
magnitude limit of the sample. The algorithm is thus able to 
find a clear, closest class even for the faintest objects, which 
supports the robustness of the classification. 

As a second check, we looked at some of the images of 
the faint end of the sample (Fig. 0}. We confirm that high- 
probability values for a given morphological class still corre- 
spond to galaxies that closely look like galaxies in this given 
class independently of the magnitude. It therefore seems that 
the classification is robust even for the faintest objects in the 
sample and that no major misclassifications are evident. In 
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Fig. 2: Distribution of the main parameters used (concentration (r90/r50), axis ratio (b/a), and color (g-r)) as a function of probability. All 
parameters are measured in the i band (see text). 



section|5]we perform a detailed comparison with a visual clas- 
sification of faint objects. 

4.2. Dependence on the training set 

Another important point that should be studied is the effect 
of changes in the training set on the final classification. In 
fact, a robust classification should not change significantly if 
some elements are removed from the training sample. On the 
contrary, if removing some elements leads to a completely 
different classification, it means that the parameter space is 



not properly sampled and therefore the classification is very 
unstable. To check this point, we performed 10 different clas- 
sifications with slightly different training sets. The samples 
were gener ated by random l y sele cting a subset of 500 galax- 
ies from the lFukugita et al.l(l2007l) sample. We then compared 
the different classifications in terms of probability. These 10 
runs on the full data set take only a few minutes on a normal 
laptop. The average scatter over the 10 runs of the probabil- 
ity of being early-type (or late-type) is 12%. In other words, 
when one changes the training set, the probability for a given 
galaxy changes ~ 12% on average. This 12% scatter is com- 
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Fig. 4: Examples of galaxies with their computed probability values 



patible and even less than the typical scatter found when sev- 
eral pe^)r>j£j}erforn^ same sample 
(e. dPostman et alJl2005tlFukugita et al.ll2007l) . 



4.3. Uncertain objects 

Another way of assessing the robustness of the classification 
is by measuring the fraction of objects whose classification 
is uncertain. If this fraction appears to be too high it would 
imply that the algorithm is not working for a large fraction of 
the sample. We define uncertain objects as those for which 
the difference between the maximum and the minimum prob- 
ability value is less than 0.15; i.e. the four probabilities are in 
a range less than 0.15, so the galaxy does not clearly fit in any 
of the four morphological classes. 

There are 3013 objects verifying this condition, 0.4% of 
the whole sample. The vast majority of the objects are there- 
fore close to one (or two) morphological classes and very few 
are in an uncertain region. A visual inspection of these galax- 
ies (fig. |5]l reveals that they are small, compact, and/or dis- 
turbed objects, for which the visual morphology is also dif- 
ficult to assess. They are not, however, particularly distant or 



faint objects since the magnitude and redshift distributions are 
compatible with the ones of the full sample. 

5. Comparison with visual classifications 

5.1. Comparison with Nair & Abraham 2010 

One obvious validation check of the classification is to com- 
pare it with existing visual cl assifications. As expl ained in 
previous sections, we used the lFukugita et al.l (I2007I) catalog 
for training. It is therefore better to use a different indepen- 
dent subsample for testing the accuracy and robu s tness of the 
classification. In a recent paper. lNair & Abraham! d2010h pub- 
lished a very detailed visual catalog of 14034 galaxies in the 
SDSS with trig < 16. Galaxies in this sample are included 
in our classification, but most of them have not been used 
to build our training sample so they represent an ide al in- 
dependent cross check. Since iNair & Abraham! d2010h clas- 
sification is much more detailed than ours, we group their 
classes into 4 groups matching the 4 classes we have de- 
fined in this work. We consider elliptical galaxies objects with 
TType = -5, SOs, TType = -2, Sabs, 1 < TType < 3, and 
finally Scd, 5 < TType < 10 (see table 1 of INair & Abraham! 
1201 Ol for a definition of the TType index used in their work). 
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Fig. 3: Probability distributions of bright (m g < 16, red dotted line) 
and faint (m g > 16, black solid line) galaxies in the sample. The 4 
panels show the 4 computed probabilities as indicated in the x-axis 
labels. 




Fig. 5: Examples of uncertain classifications as defined in the text. 



Figure[6]shows the probability distributions of these 4 groups. 
Globally, we observe a good correlation between the proba- 
bility values and the visual class. For example, galaxies visu- 
ally classified as ellipticals have on average a probability of 
~ 0.8 of being ellipticals and ~ 0.2 of being SO. The two other 
probabilities are almost zero. Traditionally, it is well known 
that it is very difficult to separate SO galaxies by eye. This 
is reflected in the probability distributions which are more 
uniform than for the pure elliptical class. A galaxy visually 
classified as SO has on average ~ 0.4 probability of being 
SO but also ~ 0.32 of being elliptical and 0.2 of being Sab, 
which reflects the difficulty of defining the SO class and the 
fact that these galaxies are indeed a transition class in terms 
of morphology between the ellipticals and the spirals. A sim- 



ilar effect is seen in the Sab population which has on average 
a probability of ~ 0.55 of being Sab but also ~ 0.15 of being 
SO or Scd. Another interesting measurement is the fraction 
of catastrophic classifications, i.e. galaxies whose automated 
and visual classes are completely different. We define those 
cases as objects for which P(E) > 0.8 and TType > 5 or 
P(Scd) > 0.8 and TType = -5, i.e. galaxies that are clearly 
elliptical (Scd) for our algorithm and visually classified as Sc 
or later (elliptical). There are only 2 objects verifying these 
conditions, and both are in the first case. They are indeed spi- 
ral galaxies, so the algorithm is wrong, but both have a large 
red bulge, which can probably account for the misclassifica- 
tion. 

5.2. Galaxy Zoo 

Recently, the Galaxy ZocQteam dLintott et al.l20ldh has made 
publicly available the visual classification of the full DR7 per- 
formed through the aggregated efforts of hundreds of thou- 
sands of people over the course of many months. This work 
is an extraordinary effort (and probably the only way) to visu- 
ally classify present and future extremely large surveys. The 
main drawback, however, is that it requires plenty of time 
(more than 2 years in this case) to collect all the informa- 
tion and put all the catalogs in place. It is therefore a very 
interesting question to see how our automated classification 
behaves compared to this visual classification. Our classifica- 
tion is indeed much faster and can be run several times with 
different parameters in just a few minutes, but it is not obvi- 
ous whether we can reach an accuracy similar to the human 
brain. Moreover, this comparison also enables the comparison 
for the faint end of the sample (since the Galaxy Zoo catalog 
contains all galaxies), hence a new evaluation of the effect of 
lacking faint objects in the training sample (see section 0). 
The classification made in the framework of the Galaxy Zoo 
is less de tailed than a pure visual cla ssification, such as th e 
one from lNair & Abrahaml d2010h or iFukugita et alJ d2007l) : 
i.e, they basically asked people if the galaxy is elliptical 
like (which should include SOs) or spiral like (with different 
subcategories like clockwise or anti-clockwise rotation), but 
without submorphological types. Galaxy Zoo ^ and Hubble 
ZocQ will furnish more detailed classifications in the coming 
future but are not publicly available for the moment. The con- 
fidence of the classification in the current release is measured 
by the fraction of votes received, since each galaxy is clas- 
sified by several persons. A galaxy is then flagged as early- 
type or spiral-like if the fraction of votes in one of those cat- 
egories is greater than 80%. In figure [7] we show the prob- 
ability distribution obtained with the galSVM classification 
for galaxies flagged as elliptical like (flag ELLIPTICAL =1) 
and spiral like (flag SPIRAL =1), respectively. We observe an 
extremely good correlation between both classifications even 
for faint galaxies not necessarily well represented in the train- 
ing set as discussed in § |4] Galaxies flagged as ellipticals in 
the Galaxy Zoo catalog have a median probability of 0.92 of 
being elliptical or SO and the same for galaxies classified as 
spirals. This means that robust classifications in Galaxy Zoo 
are also very sure classifications in our catalog; however the 

1 http://galaxyzoo.org/ 

2 http://zoo2.galaxyzoo.org/ 

3 http://hubble.galaxyzoo.org/ 
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Fig. 6: Probability distributions of the 4 morphological types considered in this work for 4 visual types (TType) from Nair & Abraham 
(2010). Each different panel shows a different visual type. Top left panel shows galaxies with TType = -5 (Ellpticals); top right panel: 
TType = -2 (SOs); bottom left panel: 1 < TType < 3 (Sabs) and finally bottom right panel: 5 < TType < 10 (Scd). Red short dashed lines 
are P(E), orange dashed dotted lines are P(S0), green dashed three dotted lines are P(Sab), and blue long dashed lines are P(Scd). See text 
for details of how these 4 probabilities are computed. 



fraction of galaxies without a clear morphological type (i.e. 
the fraction of votes is less than 80% so they lie somewhere 
between a pure early-type or late-type galaxy) in the Galaxy 
Zoo is relatively high (~ 60%), so it is interesting to check 
where all these remaining galaxies fall. 
For that purpose, we push the comparison a bit further. As 
a matter of fact, since the quality of the classification in the 
Galaxy Zoo is measured by the number of votes, another in- 



teresting test is to compare our probability measurement to 
the fraction of votes. In other words: does the probability 
measurement reflect the choice of the majority? We indeed 
expect to find a correlation, since certain classifications in 
terms of votes should also be galaxies close to the canon- 
ical definition, hence objects with high probability values. 
This comparison is shown in fig. [8] There issignificant scat- 
ter, but we observe 2 clear clouds. Objects with a high fraction 
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of votes for being elliptical have high probability values and 
vice-versa. The same behavior is measured for spirals. When 
we average the fraction of votes per probability bin, the cor- 
relation becomes clearer, and we find that there is a mono- 
tonic relation between the fraction of votes and the probabil- 
ity (Fig. [8j. This fact confirms that our probability measure- 
ment indeed measures the robustness of the classification for 
a given object. 

In figures [9] and [10] we compare the fraction of votes with 
the 4 more detailed probabilities computed in this work 
(P(E),P(S0),P(Sab), and P(Scd)). We again find a clear cor- 
relation between the number of votes given by people and the 
probability computed in an automated way by galSVM. 



6. How to use the catalog? 

The most important new point of the classification presented 
in this work is the measurement of probabilities. Therefore, a 
morphological class is not defined as a closed box, but there 
is more like a continuous transition from one class to another. 
How can this new property can be used for selecting a par- 
ticular population and studying its properties? If one wants 
to perform luminosity or mass functions for a given morpho- 
logical type, the optimal way (in terms of optimal estimation) 
is to make use of the proba bility measure as a weight for th e 
galaxy counts. As shown in lHuertas-Companv et alj d2009bl) . 
we can define a random variable Y^. 



with a probability 1 - Pjy pe 

1 with a probability Pj ype 



. This way, the number of galaxies of a given morphological 
type in a mass or luminosity bin is simply given by its math- 
ematical expectation, 

Njype = Yj P TyP°> ^ 
Nob j 

and the 1 — <r error is the square root of the variance: 

O-Type = J] P TyP e X (1 _ P Type)- (6) 
Nob j 

All the galaxies contribute to the mass function of a given 
morphological type weighted by its probability. As a result, a 
galaxy that is 95% Sd and 0.5% E will still contribute to the 
mass function of elliptical galaxies with a weight of 0.005. 

Another approach is to make probability cuts. This way, 
we decide that galaxies belong to a given class by applying 
a probability threshold. This approach (even if not optimal) 
should be closer to the classical approach from visual classi- 
fications in which galaxies only contribute in one given class. 
The threshold to apply depends on the application. For ex- 
ample, it is interesting to determine which threshold is the 
best to get similar distributions than with visual classifica- 
tions. In figure Q~2] we compare the two estimations of the 
observed distribution of stellar masses with the ones obta ined 
from the visual classification of iNair & Abrahaml d2010t) . We 
use a threshold of Prnpe > 0.45 in each type and obtain sim- 
ilar distributions for all morpholog i cal ty pes. Stellar masses 
are t aken from thelNair & Abrahaml d2010l) catalog, also taken 
from lKauffmann et al.l d2003l) estimates. 



In figure QT| we show the observed distribution of stel- 
lar masses for the whole sample for different morphologi- 
cal types using the probability e stimator. In this c ase, stel- 
lar masses ar e computed wi t h the Bell et al] d2003l) formula, 
adapted from lBernardi et all d20 1 Obb to account for evolution: 



log w (M* e " /M B ) 



1.097(g-r)-0.406-0.4(M r -4.67)-0.19z.(7) 



We observe the expected trend; i.e, the mass function peaks 
at lower values for later morphological types. In the same fig- 
ure, we compare the distribution of masses obtained from the 
Galaxy Zoo classification. We compare the one obtained with 
galaxies flagged as ellipticals (FLAG ELLIPTICAL = 1) with 
the one obtained using the two estimators described above, 
i.e. galaxies having p(E) > 0.5 and probability weighting. 
The same is computed for spirals. There is almost a perfect 
match with the distributions computed using galSVM, which 
again confirms the accuracy of the automated classification 
presented in this paper. 

Another common application is to study the color-stellar 
mass diagrams for different "robust" morphological types. 
Again, the probability estimator can be used by comput- 
ing the 2D histogram of galaxies in the color-mass plane 
weighted with the probabilities. Figure [T3] shows the prob- 
ability contours in the color-stellar mass plane for the 4 mor- 
phological types. We observe the expected trend: elliptical 
and SO galaxies are redder with less scatter, while Sab and 
Scd are bluer. An interesting feature of Sab galaxies (and 
for some Scd) is that there seems to be 2 distinct popula- 
tions: one red population and another one lying in the so- 
called green valley between the blue cloud and the red se- 
quence . After careful visual inspection of an important frac- 
tion of these red galaxies, we can confirm that for most of 
them they are in fact edge-on spirals probably reddened by 
dust. A small fraction are, ho wever, real passive s pirals as 
shown and carefully studied by Masters et al.l d2010 a.b). Most 
of them are classified as Sab galaxies with high probability 
(see figure|4]i. This result confirms that a pure color selection 
is not enough to select ellipticals or SO galaxies since it is 
highly polluted b y edge-on spirals as already shown in pre- 
vious works (e.g. Schawinski et al.ll2007b iLintott et al.ll2008l : 
iBernardi et aillioiObl) . 

These plots are just shown here to validate the morpho- 
logical classification. A more detailed analysis of the funda- 
mental parameters of galaxies is expected to come in future 
dedicated papers. 



7. Summary and conclusions 

We have presented an automated morphological classification 
of the SDSS DR7 spectroscopic sample. The algorithm used 
is based on SVM, and the most interesting and new property 
is that it associates a probability value to each galaxy instead 
of a single class. This way, the transition between one class 
and another is continuous, which should be a better approx- 
imation to nature and to visual classifications. As a matter 
of fact, when the brain decides which morphological class is 
closer to a given object we are looking at, it probably also 
implicitly measures some parameters and computes distances 
in this virtual parameter space to decide which one is the 
closest canonical class to the object it is classifying. In that 
sense, even if the list of parameters we measure is reduced 
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0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

P(E) + P(S0) P(Sab) + P(Scd) 



Fig. 8: Probability values computed with galS VM compared to the fraction of votes for ellipticals (left pannel) and spirals (right pannel). 
Gray scales are scaled to the data; i.e. white is maximum and black is minimum. Lines show the average fraction of votes in 0.05 probability 
bins 



and much more simplistic than what our brain can do (e.g we 
are not including spiral arms nor tidal features that certainly 
play an important role in a visual classification), the spirit 
of our approach is closer to a classical visual classification 
than other existing automated methods. The results obtained 
are in good agreement with existing visual classifications 
and are robust even at the faint end of the sample. The main 
advantage of this approach is that it is fast (a few minutes 
on a regular laptop) and reproducible. Moreover, we obtain 
a classification into 4 morphological types instead of the 2 
obtained in the Galaxy Zoo. The probability measurements 
can be used as a weighting factor for computing statistical 
quantities, such as luminosity or mass functions, or as 
a selection criterion to be sure that a cleaned sample of 
galaxies is selected. The classification is intended for use in 
many different applications and is therefore freely available at 
http : //gepicom(84 . obspm. fr/sdss_morphology/Morphi 
and soon from the CasJobs database. In subsequent papers, 
the classification will be used to compare spectroscopic 
and morphological classifications and investigate possible 
transitions in color-mass space (Sanchez-Almeida et al. in 
preparation) and to study the morphological properties of 
galaxies around BCGs (Bernardi et al. in preparation). 

Acknowledgements. The authors are grateful to F. Hammer for reading the 
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Fig. 9: Comparison between the fraction of votes for a galaxy to be ellipical like from Galaxy Zoo and the computed probabilities in this 
work. Gray scales are scaled to the data; i.e. white is maximum and black is minimum. Solid line shows the average relation. The average is 
computed in 0.05 probability bins. 
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Fig. 10: Comparison between the fraction of votes for a galaxy to be spiral like from Galaxy Zoo and the computed probabilities in this 
work. Gray scales are scaled to the data, i.e. white is maximum and black is minimum. Solid line shows the average relation. The average is 
computed in 0.05 probability bins. 
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Fig. 11: Observed distribution of masses for different morphological types computed using different estimators described in the text (see text 
for details). In the left panel the whole sample is shown using the probability weighting. Red short dashed line: ellipticals; yellow dashed 
dotted line: SOs; green dashed three dotted line: Sabs; blue long dashed line: Scds. In the right panel, we show galaxies flagged as SPIRAL 
and ELLIPTICAL in the galaxy zoo. Red solid lines are galaxies flagged as ellipticals in Galaxy Zoo (FLAG ELLIPTICAL =1), red dashed 
line is the distribution obtained using probability weighting and red dots are galaxies with p(E) > 0.5. Blue solid lines are galaxies flagged 
as spirals ( FLAG SPIRAL = 1) in the Galaxy Zoo, blue dashed line is the distribution obtained using probability weighting and blue dots 
are galaxies with p(Sab) + P(Scd) > 0.5. 



Table A.l: First 10 objects in the catalog. Columns are: id, identification number, SpecObjld, id from the SDSS spectroscopic catalog, RA, 
right ascension, DEC: declination, z, redshift from the SDSS database, p(Early), probability of being early-type (E or SO), p(E), probability 
of being elliptical, p(S 0), probability of being SO, p(S ab), probability of being Sa or Sb, p(Scd), probability of being Sc or Sd and ask_class, 
the spectral class from 1 Sanchez Almeida et al] d2010h 



id 


SpecObjld 


RA 


DEC 






p( Early) 


p(E) 


p(S0) 


p(Sab) 


p(Scd) 


ask_class 


1 


7509409297491... 


146.7441406 


-0.6522176 


0.203 




0.941 


0.790 


0.150 


0.032 


0.026 


2.0 


2 


7509409298330... 


146.6285706 


-0.7651463 


0.064 




0.145 


0.023 


0.121 


0.641 


0.213 


0.0 


3 


7509409301266... 


146.9341278 


-0.670413 


0.121 




0.969 


0.861 


0.108 


0.016 


0.013 


0.0 


4 


7509409301685... 


146.9638977 


-0.5450143 


0.056 




0.061 


0.011 


0.049 


0.440 


0.498 


10.0 


5 


7509409302105... 


146.9635162 


-0.7593367 


0.09 




0.802 


0.169 


0.632 


0.135 


0.062 


3.0 


6 


7509409302524... 


146.9499969 


-0.5922154 


0.064 




0.120 


0.020 


0.100 


0.762 


0.116 


10.0 


7 


7509409303363... 


146.8598328 


-0.8089029 


0.126 




0.834 


0.038 


0.796 


0.089 


0.076 


1.0 


8 


7509409303783... 


146.5927277 


-0.7602585 


0.064 




0.188 


0.026 


0.161 


0.618 


0.193 


9.0 


9 


7509409304202... 


146.8576965 


-0.6628734 


0.084 




0.004 


0.001 


0.003 


0.451 


0.543 


9.0 


10 


7509409304621... 


146.727951 


-0.5568492 


0.089 




0.939 


0.721 


0.217 


0.031 


0.029 


0.0 
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Fig. 12: Observed distribution of masses for different morphological types in the lNair & Abrahaml J2010h sample using different estimators 
(see tex for details). Black solid lines: visual cl assification; red fi l led ci rcles: probability cuts; red dashed line: probability estimates. Each 
panel shows a visual morphological class from lNair & Abraharr] j2010l) . selected as described in the text. For the probability cuts, we use 
P > 0.45 in this given type. 
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Fig. 13: Color magnitude relation for the 4 morphological t ypes. Contours are com puted by probability weighting. For reference, we show 
in the 4 panels the best fit to the elliptical red sequence from lBernardi et all |2010af) . Top left panel: Ellipticals, Top right panel: SOs, bottom 
left panel: Sab galaxies, bottom right panel: Scd galaxies. 
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